NRG-GY003 WES Analysis

Whole Exome genomic analysis of NRG-GY003 ovarian cancer tumors

Genomic analysis of tumor exomes from patients with recurrent ovarian cancer treated with immune checkpoint inhibitors (NIVO or IPI/NIVO) treated on the NRG-GY003 clinical trial.
Author
Affiliation

Kelsey Monson, PhD, MS

Icahn School of Medicine at Mount Sinai

Published

August 1, 2025

Keywords

Genomics, Whole Exome, Immunotherapy, R, Ovarian Cancer

Intro

Variant Filtering Strategy

I first used the Nextflow Sarek pipeline to align the raw fastq files. Below is the workflow I used:

Note that, at this time, I have only merged the variants from the SNP/indel callers, as ASCAT and the other SV/CNV callers do not have .vcfs as their output.

Here is the detailed variant filtering strategy I used to come up with the MAF file used in this analysis:

  • First find consensus somatic calls:

    • Present in >=2 somatic callers (Freebayes, Mutect2, Strelka2)
    • Include variants annotated with PASS
      • PASS: indicates the variant passed all quality filters applied by the variant caller

      • Considered the gold-standard for high-quality variants passing the variant caller’s built-in quality control filters

    • Further refine PASS variants
      • Minimum tumor Read Depth = 20: ensures sufficient coverage (i.e. evidence) to confidently call tumor variants from normal and/or distinguish from random sequencing errors

      • Allele frequency 0.05 < AF < 0.95:

        • >0.05: also helps avoid random sequencing errors/miss-called variants

        • <0.95: helps avoid germline contamination

  • Identify rare pathogenic germline variants:

    • Present in >=2 germline callers (Freebayes, Haplotypecaller, Strelka2)

    • Identify those most likely to be pathogenic

      • Limit to protein-coding variants (drop all intergenic and non-coding variants)

      • Must be annotated with HIGH impact

    • Eliminate common variants

      • “Rare” variants are typically those present in <1% of the population

      • Because I was worried about missing variants, I initially excluded those with >5% population frequency, but this was too liberal.

      • I revised the script to set the threshold to 1% and will re-generate the MAF file eventually

      • For the current analysis, I QC’d the variants and only identified likey oncogenic variants in BRCA1/2, so limited the germline analysis to those genes.

    • Include all likely oncogenes regardless of population frequency (“BRCA1” “BRCA2” “TP53” “PTEN” “ATM” “CHEK2” “PALB2” “APC” “MUTYH”)

  • Generated consensus MAF file

    • The above analysis generated two vcfs, one with the processed somatic variants and one with likely pathogenic germline variants

    • These variants were annotated as SOMATIC or GERMLINE_PATHOGENIC, respectively in the vcfs

    • I then concatenated these variants to one MAF file for the entire sample using the Nextflow vcf2maf pipeline.

This consensus MAF is what is used for the downstream analysis presented here.

Loading in the data

Packages Used
# Load packages

library(maftools) # For majority of maf file analysis
library(mclust)
library("BSgenome.Hsapiens.UCSC.hg38", quietly = TRUE)
library(NMF) # For signature calculation
library(pheatmap) # For nice heatmaps
library(ghibli) # For pretty colors
library(gt) # For nice tables (maybe?)

Load in raw MAF file and the clinical data

laml = read.maf(maf="input/merged_consensus_all_samples_germline.maf",
                clinicalData ="input/NRG-GY00_laml_annot_2.csv",
                rmFlags = TRUE # "FLAGS, frequently mutated genes in public exomes"
                )
-Reading
-Validating
-Removing 20 FLAG genes
-Silent variants: 30668 
-Summarizing
-Processing clinical data
-Finished in 3.380s elapsed (3.110s cpu) 
Tip

Setting rmFlags = TRUE removes frequently mutated genes in public exomes

Details on these genes can be found here.

Data cleaning: setting levels for factor variables
# RECIST response (CR, PR, SD, PD)
laml@clinical.data$response <- factor(laml@clinical.data$response, levels=c("CR","PR","SD","PD"))
#table(laml@clinical.data$response)

# Detailed response including durable and progressive SD
laml@clinical.data$response2 <- factor(laml@clinical.data$response2, levels=c("CR","PR","SD_CB","SD_NCB","PD"))
# table(laml@clinical.data$response2)

# Response by treatment
laml@clinical.data$response_tx <- factor(laml@clinical.data$response_tx, levels=c("Nivo_CB","Combo_CB","Nivo_NCB","Combo_NCB"))
#table(laml@clinical.data$response_tx)

# Race (White, Non-White, and Not Reported)
laml@clinical.data$race2 <- factor(laml@clinical.data$race2, levels=c("W","NW","NR"))
#table(laml@clinical.data$race2)

# Site (dichotomous)
laml@clinical.data$Site2 <- factor(laml@clinical.data$Site2, levels=c("Adnexa","Non_Adnexa"))
#table(laml@clinical.data$Site2)

# Primary/Met
laml@clinical.data$Primary_Met <- factor(laml@clinical.data$Primary_Met, levels=c("P","M"))
#table(laml@clinical.data$Primary_Met)

Subsetting datasets

We have some normal samples with no matched tumor, and we may be interested in doing subtype-specific analysis.

Let’s subset to only tumor samples, and then create further subsetted datasets based on clinical characteristics.

Only tumor samples

I also annotated with pathogenic germline variants; the only relevant ones were in BRCA1 and BRCA2, so we are subsetting to somatic mutations and pathogenic germline variants.

laml_tum = subsetMaf(maf = laml, clinQuery = "Tissue %in% 'Tu'")
-subsetting by clinical data..
--86 samples meet the clinical query
-Processing clinical data
laml_tum = subsetMaf(maf = laml_tum, query = "vcf_id %in% c('SOMATIC','GERMLINE_PATHOGENIC') ")
-Processing clinical data

Other subsets

We can also subset by tumor histology, treatment, and response.

## Comparing Serous vs others
laml_ser = subsetMaf(maf = laml_tum, clinQuery = "celltype %in% 'Serous_Adenocarcinoma'")
-subsetting by clinical data..
--71 samples meet the clinical query
-Processing clinical data
# Subsetting the other dataset to "not serous"
`%ni%` <- Negate(`%in%`)
laml_other = subsetMaf(maf = laml_tum, clinQuery = "celltype %ni% 'Serous_Adenocarcinoma'")
-subsetting by clinical data..
--15 samples meet the clinical query
-Processing clinical data
## Comparing CB vs NCB
laml_CB = subsetMaf(maf = laml_tum, clinQuery = "responseCB %in% 'CB'")
-subsetting by clinical data..
--29 samples meet the clinical query
-Processing clinical data
laml_NCB = subsetMaf(maf = laml_tum, clinQuery = "responseCB %in% 'NCB'")
-subsetting by clinical data..
--57 samples meet the clinical query
-Processing clinical data
## Subsetting by treatment to see if there are differences
### There shouldn't be any, since treatment assignment was randomized, but is is good to confirm.
laml_nivo = subsetMaf(maf = laml_tum, clinQuery = "tx %in% 'Nivo'")
-subsetting by clinical data..
--43 samples meet the clinical query
-Processing clinical data
laml_combo = subsetMaf(maf = laml_tum, clinQuery = "tx %in% 'Combo'")
-subsetting by clinical data..
--43 samples meet the clinical query
-Processing clinical data

View the MAF file summary

Here is a basic summary of the MAF file

laml_tum
An object of class  MAF 
                        ID summary    Mean Median
                    <char>  <char>   <num>  <num>
 1:             NCBI_Build  GRCh38      NA     NA
 2:                 Center       .      NA     NA
 3:                Samples      86      NA     NA
 4:                 nGenes    6897      NA     NA
 5:        Frame_Shift_Del     234   2.721    2.0
 6:        Frame_Shift_Ins      43   0.500    0.0
 7:           In_Frame_Del      83   0.965    0.0
 8:           In_Frame_Ins       6   0.070    0.0
 9:      Missense_Mutation    9193 106.895   81.0
10:      Nonsense_Mutation     524   6.093    4.0
11:       Nonstop_Mutation      13   0.151    0.0
12:            Splice_Site     505   5.872    3.0
13: Translation_Start_Site      19   0.221    0.0
14:                  total   10620 123.488   92.5
These are some more summaries you can generate
# I'm commenting them out as they have too much info for the tables to be readable.

# #Shows sample summary.
# getSampleSummary(laml_tum)

# #Shows gene summary.
# getGeneSummary(laml_tum)

# #shows clinical data associated with samples
# getClinicalData(laml_tum)

# #Shows all fields in MAF
# getFields(laml_tum)

Visual Summaries

Summarize the maf file

We can use plotmafSummary to plot the summary of the maf file, which displays number of variants in each sample as a stacked barplot and variant types as a boxplot summarized by Variant_Classification.

plotmafSummary(maf = laml_tum, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

#Use mafbarplot for a minimal barplot of mutated genes.
mafbarplot(maf =  laml_tum)

Summarize the other subsets

Here are the mutation distributions for the other subsets.

As a sanity check, the majority of Serous samples have TP53 mutations, while there are few TP53 mutations in the top 10 for the non-Serous samples (as is expected).

By Histology

Serous

## Serous 
plotmafSummary(maf = laml_ser, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_ser)

Non-Serous

## Non-serous
plotmafSummary(maf = laml_other, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_other)

By Response

CB

## CB
plotmafSummary(maf = laml_CB, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_CB)

NCB

## NCB
plotmafSummary(maf = laml_NCB, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_NCB)

By Treatment

Nivo

## Nivo
plotmafSummary(maf = laml_nivo, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_nivo)

Combo

## Combo
plotmafSummary(maf = laml_combo, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf =  laml_combo)

Oncoprints

Oncoprints (or “oncoplots” as the wrapper function in maftools is called) summarize complex genomic features of a given dataset.

#oncoplot for top ten mutated genes.
oncoplot(maf = laml_tum, top = 10)

These are the key features and how they are interpreted:

  • Columns represent individual patients.
    • Reading column-wise, you can see the collection of alterations in a patient’s tumor for the given set of genes.
    • The bar plot on the top is a count of tumor mutation burden per patient, color-coded by mutation type.
  • Rows are altered genes.
    • Alterations are color-coded to indicate their type (e.g. mutation, deletion, insertion)
    • Genes can be further annotated with key features (e.g. high impact/likely pathogenic mutations, germline variants, etc.)
  • The bar plot on the right summarizes alterations in a given gene for the entire dataset.
  • Many more features and annotations can be added to further characterize the dataset.
#KelseyNote

Something to be aware of, especially as there are a number of mucins that are coming up, is that some samples were tumor-only and some were matched normal; would be good to annotate the metadata with this info and see if the mucins are coming up more in the tumor-only samples, since these genes tend to be highly polymorphic.

Also re: mucins, MUC16 is one of the genes that’s filtered out when using the FLAGS filter (it’s #2 on the list of top mutated genes). If we include it, it’s the 13th most mutated gene, but there are only 9 patients who have mutations.

Questions

  1. Do we care about MUC16 mutations?

Chatted with Yanis about this, we might be interested in this, at least in relation to his project for where the mutations are coming from (if they’re all in the cleaved region, we don’t care so much, but if they’re in the membrane-bound region, this could be a problem). He will see where his antibody binds, and I will work on making a lollipop plot in our data. Based on cBioPortal, all the ovarian cancer mutations are in the extracellular region, not the transmembrane or cytoplasmic regions (which is good).

  1. Should we be excluding the other mucins from the analysis? (And does it matter beyond individual gene-level analysis? It could affect multiple testing adjustments.)
# Highlight by specific attribute in MAF file
oncoplot(maf = laml_tum, top = 10,
         additionalFeature = c("IMPACT", "HIGH"))

# Add transitions/transversions  
oncoplot(maf = laml_tum, top = 10, draw_titv = TRUE)

# cBioPortal alteration nomenclature 
oncoplot(maf = laml_tum, top = 10, cBioPortal = TRUE)

Oncoprints by clinical data

## Primary vs metastatic sites
oncoplot(maf = laml_tum, clinicalFeatures = 'Primary_Met', sortByAnnotation = TRUE)

## Cell type
oncoplot(maf = laml_tum, clinicalFeatures = 'celltype', sortByAnnotation = TRUE)

## ICI response
oncoplot(maf = laml_tum, clinicalFeatures = c('responseCB','response2'), sortByAnnotation = TRUE)

## ICI response and treatment
oncoplot(maf = laml_tum, clinicalFeatures = 'response_tx', sortByAnnotation = TRUE)

## Mutational signature
oncoplot(maf = laml_tum, clinicalFeatures = 'Signature', sortByAnnotation = TRUE)

## Race
oncoplot(maf = laml_tum, clinicalFeatures = 'race', sortByAnnotation = TRUE)

## Site
oncoplot(maf = laml_tum, clinicalFeatures = c('Site2','Site'), sortByAnnotation = TRUE)

## Stratified by histology
## Serous
oncoplot(maf = laml_ser, clinicalFeatures = c('Site2','Site','celltype'), sortByAnnotation = TRUE)

## Non-Serous
oncoplot(maf = laml_other, clinicalFeatures = c('Site2','Site','celltype'), sortByAnnotation = TRUE)

Oncogenic signaling pathways

sigpw plots enrichment for known oncogenic signaling pathways as defined in TCGA, Sanchez/Vega et al.

# Plot genes in 2 top oncogenic signaling pathways
oncoplot(maf = laml_tum, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 2)

# Collapse to just plot the pathways, now top 5
oncoplot(maf = laml_tum, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE)

Pathways by Response

# Response
oncoplot(maf = laml_tum, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE,
         clinicalFeatures = 'responseCB', sortByAnnotation = TRUE)

# Response by treatment
oncoplot(maf = laml_tum, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE,
         clinicalFeatures = 'response_tx', sortByAnnotation = TRUE)

Pathways by Histology

Since these are the top pathways for the whole cohort, they may be masking histologically-specific pathways.

This is where it is useful to subset the data. Here I am plotting the top pathways for serous and all non-serous patients separately.

# Serous
oncoplot(maf = laml_ser, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype')

# Non-serous 
oncoplot(maf = laml_other, pathways = "sigpw", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype')

Biological processes of known tumor drivers

smgbp plots enrichment for pan-cancer significantly mutated genes, classified by biological processes, per Bailey et al.

Here are the top 2 biological processes of known drivers

# Note that I have highlighted the BRCA1/2 germline variants
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2, 
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

Pathways by Response & Histology

# Response
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'responseCB', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

# Detailed response (note that none of the germline variants had CR or PR)
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'response2', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

# Response by treatment 
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'response_tx', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

# Histology
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'celltype', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

Pathways by mutational signature*

oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.8, topPathways = 2,
         clinicalFeatures = 'Signature', sortByAnnotation = TRUE,
         additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC"))

#KelseyNote

*Will need to return to this, but it seems like sig1 is more BRCA1 while sig2 is more BRCA2.

I need to re-call the mutational signatures (downstream of these analyses in my original script) and update this section.

Top pathways

Now collapsing the plots to show only the pathways

# Top 5 pathways 
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 5, 
         collapsePathway = TRUE)

# Top 10 pathways*
# *I've set `topPathways` = 10 but this is 24 pathways, not sure what's going on there...
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE)

# By response
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE,
         clinicalFeatures = 'responseCB', sortByAnnotation = TRUE)

# By response and treatment
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE,
         clinicalFeatures = 'response_tx', sortByAnnotation = TRUE)

# By histology
oncoplot(maf = laml_tum, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype', sortByAnnotation = TRUE)

Top pathways by histology

As with the oncogenic signaling pathways, I am plotting these separately for serous and non-serous.

# Serous
oncoplot(maf = laml_ser, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10, 
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype')

# Non-serous
oncoplot(maf = laml_other, pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 10,
         collapsePathway = TRUE,
         clinicalFeatures = 'celltype')

Additional oncoprints with more complex annotations

Code
## Color coding
ghibli_colors <- ghibli_palette("PonyoMedium", type = "discrete")
# ghibli_colors
respcolors <- c("#278B9AFF","#E75B64FF")
names(respcolors) = c("CB","NCB")
fabcolors = RColorBrewer::brewer.pal(n = 5,name = 'Spectral')
names(fabcolors) = c("CR", "PR", "SD_CB", "SD_NCB", "PD")
txcolors = c("#BEAED4","#7FC97F")
names(txcolors) = c("Nivo","Combo")
cellcolors = RColorBrewer::brewer.pal(n = 6,name = 'PRGn')
names(cellcolors) = c("Serous_Adenocarcinoma","Endometrioid_Adenocarcinoma","Adenocarcinoma",
                      "Clear_Cell","Mixed_Epithelial","Undifferentiated_Carcinoma")


anno_cols = list(tx = txcolors, response2 = fabcolors, Survtime = "Blues", celltype = cellcolors)
anno_cols2 = list(tx = txcolors, responseCB = respcolors, Survtime = "Blues", celltype = cellcolors)
#print(anno_cols2)
oncoplot(
  maf = laml_tum,
  clinicalFeatures = c('response2','tx',  'Survtime','celltype'),
  sortByAnnotation = TRUE,
  annotationColor = anno_cols
)

oncoplot(
  maf = laml_tum,
  clinicalFeatures = c('responseCB','tx','Survtime','celltype'),
  sortByAnnotation = TRUE,
  draw_titv = TRUE,
  annotationColor = anno_cols2,
  pathways = "smgbp", gene_mar = 8, fontSize = 0.6, topPathways = 2,
  additionalFeature = c("vcf_id", "GERMLINE_PATHOGENIC")
)

Other Visualizations

Transitions and transversions

Transitions are changes between two purines (A <-> G) or two pyrimidines (C <-> T), involving bases of simliar shapes.

Transversions are changes from a purine to a pyrimidine or vice versa, resulting in the exchange of one-ring and two-ring structures.

Here’s a nice picture from this page:

Transitions tend to be more common than transversions, despite there being twice as many possible transversions.

The titiv function summarizes these for the dataset:

laml.titv = titv(maf = laml_tum, plot = FALSE, useSyn = TRUE)

#plot titv summary
plotTiTv(res = laml.titv, plotNotch = TRUE)

Here are the plots separated by histology:

# Serous
laml.titv.ser = titv(maf = laml_ser, plot = FALSE, useSyn = TRUE)
plotTiTv(res = laml.titv.ser, plotNotch = TRUE)

# Non-Serous
laml.titv.other = titv(maf = laml_other, plot = FALSE, useSyn = TRUE)
plotTiTv(res = laml.titv.other)

Lollipop plots

Draws a lollipop plot of amino acid changes on a protein structure (protein domains are derived from the PFAM database).

Let’s look at the plot for TP53 (the most mutated gene in overall ovarian cohort)

# Lollipop plot for TP53
lollipopPlot(
  maf = laml_tum,
  gene = 'TP53',
  showMutationRate = TRUE
)
     HGNC    refseq.ID   protein.ID aa.length
   <char>       <char>       <char>     <num>
1:   TP53    NM_000546    NP_000537       393
2:   TP53 NM_001126112 NP_001119584       393
3:   TP53 NM_001126113 NP_001119585       346
4:   TP53 NM_001126114 NP_001119586       341
5:   TP53 NM_001126115 NP_001119587       261
6:   TP53 NM_001126116 NP_001119588       209
7:   TP53 NM_001126117 NP_001119589       214
8:   TP53 NM_001126118 NP_001119590       354

However, we know it’s the most commonly mutated gene because the majority of the samples are Serous.

This is what it looks like stratified by histology:

# Serous
lollipopPlot(
  maf = laml_ser,
  gene = 'TP53',
  showMutationRate = TRUE
)
     HGNC    refseq.ID   protein.ID aa.length
   <char>       <char>       <char>     <num>
1:   TP53    NM_000546    NP_000537       393
2:   TP53 NM_001126112 NP_001119584       393
3:   TP53 NM_001126113 NP_001119585       346
4:   TP53 NM_001126114 NP_001119586       341
5:   TP53 NM_001126115 NP_001119587       261
6:   TP53 NM_001126116 NP_001119588       209
7:   TP53 NM_001126117 NP_001119589       214
8:   TP53 NM_001126118 NP_001119590       354

# Non-Serous
lollipopPlot(
  maf = laml_other,
  gene = 'TP53',
  showMutationRate = TRUE
)
     HGNC    refseq.ID   protein.ID aa.length
   <char>       <char>       <char>     <num>
1:   TP53    NM_000546    NP_000537       393
2:   TP53 NM_001126112 NP_001119584       393
3:   TP53 NM_001126113 NP_001119585       346
4:   TP53 NM_001126114 NP_001119586       341
5:   TP53 NM_001126115 NP_001119587       261
6:   TP53 NM_001126116 NP_001119588       209
7:   TP53 NM_001126117 NP_001119589       214
8:   TP53 NM_001126118 NP_001119590       354

What about ARID1A? Here is the full cohort:

# Lollipop plot for ARID1A
lollipopPlot(
  maf = laml_tum,
  gene = 'ARID1A',
  showMutationRate = TRUE
)
     HGNC refseq.ID protein.ID aa.length
   <char>    <char>     <char>     <num>
1: ARID1A NM_006015  NP_006006      2285
2: ARID1A NM_139135  NP_624361      2068

And stratified by histology:

# Serous
lollipopPlot(
  maf = laml_ser,
  gene = 'ARID1A',
  showMutationRate = TRUE
)
     HGNC refseq.ID protein.ID aa.length
   <char>    <char>     <char>     <num>
1: ARID1A NM_006015  NP_006006      2285
2: ARID1A NM_139135  NP_624361      2068

# Non-Serous
lollipopPlot(
  maf = laml_other,
  gene = 'ARID1A',
  showMutationRate = TRUE
)
     HGNC refseq.ID protein.ID aa.length
   <char>    <char>     <char>     <num>
1: ARID1A NM_006015  NP_006006      2285
2: ARID1A NM_139135  NP_624361      2068

#KelseyNote

I need to better understand what specifically is being plotted here; based on cBioPortal, and because this is the protein that’s being shown, it’s only the exons, but I need to better understand these functional regions that are being highlighted.

The x-axis here is amino acids, not base pairs.

Rainfall plots

Rainfall plots illustrate “hypermutated” genomic regions.

Setting detectChangePoints = TRUE detects genomic change points where potential kataegis are found.

Kataegis are genomic segments containing six or more consecutive mutations with an average inter-mutation distance of <=1,000 bp.

From the maftools help page for rainfallPlot():

Kategis detection algorithm by Moritz Goretzky at WWU Munster, which exploits the definition of Kategis (six consecutive mutations with an avg. distance of 1000 bp ) to identify hyper mutated genomic loci.

  • Algorithm starts with a double-ended queue to which six consecutive mutations are added and their average inter-mutation distance is calculated.
  • If the average inter-mutation distance is larger than 1000, one element is added at the back of the queue and one is removed from the front.
  • If the average inter-mutation distance is less or equal to 1000, further mutations are added until the average inter-mutation distance is larger than 1000.
  • After that, all mutations in the double-ended queue are written into output as one kataegis and the double-ended queue is reinitialized with six mutations.

By default, it plots the most mutated sample.

rainfallPlot(
  maf = laml_tum, 
  detectChangePoints = TRUE, 
  pointSize = 0.4,
  ref.build = "hg38"
)
Processing GADCHA..
No changepoints detected!

# Here's a sample with kataegis
rainfallPlot(
  maf = laml_tum, 
  detectChangePoints = TRUE, 
  pointSize = 0.4,
   tsb = "GADGAE",
  ref.build = "hg38"
 )
Processing GADGAE..
Kataegis detected at:
   Chromosome Start_Position End_Position nMuts Avg_intermutation_dist  Size
        <num>          <num>        <num> <int>                  <num> <num>
1:         11        1179999      1182583     6                  516.8  2584
   Tumor_Sample_Barcode   C>G   C>T   T>C
                 <fctr> <int> <int> <int>
1:               GADGAE     1     3     2

#KelseyNote

I need to think of a good way to evaluate these for all samples; maybe just a script for only this for each sample and then summarize?

Think about this…

Compare TMB to TCGA

## Capture size for Twist Bioscience Human Comprehensive Exome kit (used for WES) is 36.8Mb
laml.mutload = tcgaCompare(maf = laml_tum, cohortName = 'NRG-GY003', logscale = TRUE, capture_size = 36.8)

# Just in serous
laml.mutload = tcgaCompare(maf = laml_ser, cohortName = 'NRG-GY003', logscale = TRUE, capture_size = 36.8)

# Other
laml.mutload = tcgaCompare(maf = laml_other, cohortName = 'NRG-GY003', logscale = TRUE, capture_size = 36.8)

#KelseyNote

This is much better than when I used the Genewiz analysis (where they likely used a common reference genome and called everything outside that a somatic mutation), but this still seems a bit high compared to the ovarian cohort.

I know most of the TCGA samples are primaries, but I guess most of these are too, so it should be fairly comparable. Need to consider how much of a problem this is (or isn’t).

Analysis

Somatic interactions

Mutually exclusive or co-occurring set of genes can be detected using the somaticInteractions function, which performs pair-wise Fisher’s Exact test to detect such significant pair of genes.

# Exclusive/co-occurance event analysis on top 10 mutated genes
## Full cohort
somaticInteractions(maf = laml_tum, top = 25, pvalue = c(0.05, 0.1))

         gene1     gene2       pValue oddsRatio    00    01    11    10
        <char>    <char>        <num>     <num> <int> <int> <int> <int>
  1: GOLGA6L10     MUC3A 1.488870e-09       Inf    78     1     7     0
  2:  PRAMEF18    GTPBP6 5.241201e-08 197.81731    76     1     7     2
  3:      TP53  PRAMEF18 1.019193e-05   0.00000    18     9     0    59
  4:    GTPBP6 GOLGA6L10 3.170550e-05  54.07523    76     2     5     3
  5:    GTPBP6  GOLGA6L9 3.170550e-05  54.07523    76     2     5     3
 ---                                                                   
296:    MYO18B GOLGA6L10 1.000000e+00   0.00000    72     7     0     7
297:     TENM3 GOLGA6L10 1.000000e+00   0.00000    72     7     0     7
298:    MYO18B  GOLGA6L9 1.000000e+00   0.00000    72     7     0     7
299:     SVEP1  GOLGA6L9 1.000000e+00   0.00000    72     7     0     7
300:     TENM3  GOLGA6L9 1.000000e+00   0.00000    72     7     0     7
             pAdj              Event              pair event_ratio
            <num>             <char>            <char>      <char>
  1: 3.446459e-08       Co_Occurence  GOLGA6L10, MUC3A         7/1
  2: 1.129569e-06       Co_Occurence  GTPBP6, PRAMEF18         7/3
  3: 2.054825e-04 Mutually_Exclusive    PRAMEF18, TP53        0/68
  4: 5.661696e-04       Co_Occurence GOLGA6L10, GTPBP6         5/5
  5: 5.661696e-04       Co_Occurence  GOLGA6L9, GTPBP6         5/5
 ---                                                              
296: 1.000000e+00 Mutually_Exclusive GOLGA6L10, MYO18B        0/14
297: 1.000000e+00 Mutually_Exclusive  GOLGA6L10, TENM3        0/14
298: 1.000000e+00 Mutually_Exclusive  GOLGA6L9, MYO18B        0/14
299: 1.000000e+00 Mutually_Exclusive   GOLGA6L9, SVEP1        0/14
300: 1.000000e+00 Mutually_Exclusive   GOLGA6L9, TENM3        0/14
## Serous only
somaticInteractions(maf = laml_ser, top = 25, pvalue = c(0.05, 0.1))

      gene1  gene2       pValue oddsRatio    00    01    11    10        pAdj
     <char> <char>        <num>     <num> <int> <int> <int> <int>       <num>
  1:   TP53 LILRB3 4.838743e-06   0.00000     8     7     0    56 0.000112008
  2: MYO18B  DNAH1 2.205783e-04  51.73280    63     2     4     2 0.004753842
  3: PLXNB2    EYS 6.320328e-03  18.60432    62     3     3     3 0.119703175
  4:  RIMS1 PLXNB2 6.320328e-03  18.60432    62     3     3     3 0.119703175
  5:  HMCN2 LOXHD1 1.068399e-02  14.02065    61     4     3     3 0.180472880
 ---                                                                         
296:   PEG3  MYOM2 1.000000e+00   0.00000    59     6     0     6 1.000000000
297: PLXNB2  MYOM2 1.000000e+00   0.00000    59     6     0     6 1.000000000
298:  RIMS1  MYOM2 1.000000e+00   0.00000    59     6     0     6 1.000000000
299: PLXNB2   PEG3 1.000000e+00   0.00000    59     6     0     6 1.000000000
300:  RIMS1   PEG3 1.000000e+00   0.00000    59     6     0     6 1.000000000
                  Event          pair event_ratio
                 <char>        <char>      <char>
  1: Mutually_Exclusive  LILRB3, TP53        0/63
  2:       Co_Occurence DNAH1, MYO18B         4/4
  3:       Co_Occurence   EYS, PLXNB2         3/6
  4:       Co_Occurence PLXNB2, RIMS1         3/6
  5:       Co_Occurence HMCN2, LOXHD1         3/7
 ---                                             
296: Mutually_Exclusive   MYOM2, PEG3        0/12
297: Mutually_Exclusive MYOM2, PLXNB2        0/12
298: Mutually_Exclusive  MYOM2, RIMS1        0/12
299: Mutually_Exclusive  PEG3, PLXNB2        0/12
300: Mutually_Exclusive   PEG3, RIMS1        0/12
## Non-Serous
somaticInteractions(maf = laml_other, top = 25, pvalue = c(0.05, 0.1))

        gene1     gene2       pValue oddsRatio    00    01    11    10
       <char>    <char>        <num>     <num> <int> <int> <int> <int>
  1:   MUC5AC    GTPBP6 0.0003330003       Inf    10     0     5     0
  2: PRAMEF18  GOLGA6L9 0.0007326007       Inf    11     0     4     0
  3:     TP53     BIRC6 0.0021978022       Inf    12     0     3     0
  4:    ZBED3 GOLGA6L10 0.0021978022       Inf    12     0     3     0
  5: GOLGA6L9    GTPBP6 0.0036630037       Inf    10     1     4     0
 ---                                                                  
296:  CCDC187     ABCA4 1.0000000000         0    11     2     0     2
297:  CCDC187      ABL2 1.0000000000         0    11     2     0     2
298:   HECTD4   CCDC187 1.0000000000         0    11     2     0     2
299:   IFT140   CCDC187 1.0000000000         0    11     2     0     2
300:     SLX4   CCDC187 1.0000000000         0    11     2     0     2
           pAdj              Event               pair event_ratio
          <num>             <char>             <char>      <char>
  1: 0.03468753       Co_Occurence     GTPBP6, MUC5AC         5/0
  2: 0.03815629       Co_Occurence GOLGA6L9, PRAMEF18         4/0
  3: 0.05283178       Co_Occurence        BIRC6, TP53         3/0
  4: 0.05283178       Co_Occurence   GOLGA6L10, ZBED3         3/0
  5: 0.06733463       Co_Occurence   GOLGA6L9, GTPBP6         4/1
 ---                                                             
296: 1.00000000 Mutually_Exclusive     ABCA4, CCDC187         0/4
297: 1.00000000 Mutually_Exclusive      ABL2, CCDC187         0/4
298: 1.00000000 Mutually_Exclusive    CCDC187, HECTD4         0/4
299: 1.00000000 Mutually_Exclusive    CCDC187, IFT140         0/4
300: 1.00000000 Mutually_Exclusive      CCDC187, SLX4         0/4
## CB
somaticInteractions(maf = laml_CB, top = 25, pvalue = c(0.05, 0.1))

       gene1     gene2      pValue oddsRatio    00    01    11    10       pAdj
      <char>    <char>       <num>     <num> <int> <int> <int> <int>      <num>
  1: ADPRHL1    ADGRG4 0.002736727       Inf    24     2     3     0 0.06335016
  2:   OPLAH   DYNC2H1 0.004252453   46.9969    24     1     3     1 0.08053887
  3:  MUC5AC     HMCN2 0.004252453   46.9969    24     1     3     1 0.08053887
  4:    NRDC     HMCN2 0.004252453   46.9969    24     1     3     1 0.08053887
  5:  ARID1A    IFT140 0.005473454       Inf    23     0     3     3 0.09774025
 ---                                                                           
296:   LRP1B    MUC5AC 1.000000000    0.0000    22     4     0     3 1.00000000
297:   ZFHX4      NRDC 1.000000000    0.0000    21     4     0     4 1.00000000
298:   ABCA4      NRDC 1.000000000    0.0000    22     4     0     3 1.00000000
299:   ZFHX4 SPATA31H1 1.000000000    0.0000    21     4     0     4 1.00000000
300:   ITPR3 SPATA31H1 1.000000000    0.0000    22     4     0     3 1.00000000
                  Event             pair event_ratio
                 <char>           <char>      <char>
  1:       Co_Occurence  ADGRG4, ADPRHL1         3/2
  2:       Co_Occurence   DYNC2H1, OPLAH         3/2
  3:       Co_Occurence    HMCN2, MUC5AC         3/2
  4:       Co_Occurence      HMCN2, NRDC         3/2
  5:       Co_Occurence   ARID1A, IFT140         3/3
 ---                                                
296: Mutually_Exclusive    LRP1B, MUC5AC         0/7
297: Mutually_Exclusive      NRDC, ZFHX4         0/8
298: Mutually_Exclusive      ABCA4, NRDC         0/7
299: Mutually_Exclusive SPATA31H1, ZFHX4         0/8
300: Mutually_Exclusive ITPR3, SPATA31H1         0/7
## NCB
somaticInteractions(maf = laml_NCB, top = 25, pvalue = c(0.05, 0.1))

        gene1    gene2       pValue oddsRatio    00    01    11    10
       <char>   <char>        <num>     <num> <int> <int> <int> <int>
  1:   LILRA6   GTPBP6 2.388284e-07       Inf    52     0     5     0
  2:   GTPBP6 PRAMEF18 5.015397e-06       Inf    50     2     5     0
  3:   LILRA6 PRAMEF18 5.015397e-06       Inf    50     2     5     0
  4:   LILRB3     TP53 7.623404e-05   0.00000    12    37     0     8
  5: PRAMEF18   LILRB3 2.543101e-04  33.58248    47     3     5     2
 ---                                                                 
296:     MTOR   LILRA6 1.000000e+00   0.00000    47     5     0     5
297:   PLXNB2   LILRA6 1.000000e+00   0.00000    47     5     0     5
298:   SPTBN4   LILRA6 1.000000e+00   0.00000    47     5     0     5
299:   PLXNB2     MTOR 1.000000e+00   0.00000    47     5     0     5
300:   SPTBN4     MTOR 1.000000e+00   0.00000    47     5     0     5
             pAdj              Event             pair event_ratio
            <num>             <char>           <char>      <char>
  1: 5.528436e-06       Co_Occurence   GTPBP6, LILRA6         5/0
  2: 1.011169e-04       Co_Occurence GTPBP6, PRAMEF18         5/2
  3: 1.011169e-04       Co_Occurence LILRA6, PRAMEF18         5/2
  4: 1.443826e-03 Mutually_Exclusive     LILRB3, TP53        0/45
  5: 4.541253e-03       Co_Occurence LILRB3, PRAMEF18         5/5
 ---                                                             
296: 1.000000e+00 Mutually_Exclusive     LILRA6, MTOR        0/10
297: 1.000000e+00 Mutually_Exclusive   LILRA6, PLXNB2        0/10
298: 1.000000e+00 Mutually_Exclusive   LILRA6, SPTBN4        0/10
299: 1.000000e+00 Mutually_Exclusive     MTOR, PLXNB2        0/10
300: 1.000000e+00 Mutually_Exclusive     MTOR, SPTBN4        0/10
#KelseyNote

I’m not sure how meaningful this analysis is.

Something to be cautious of, especially in the CB/NCB analysis, is that the cohort is enriched for Serous histology but that may not be evenly distributed between CB/NCB groups (should test this).

So we could be picking up on histological differences that we mistake for real response signals.

Pfam domains

The pfamDomains function performs domain enrichment analysis, grouping protein domains to identify the most dysregulated pathways and protein families involved in similar functions.

pfamDomains summarizes amino acid positions and annotates them with pfam domain info.

It returns two tables summarized by 1) amino acid positions and 2) domains.

It also plots the top most mutated domains as a scatter plot.

# Scatter plot with the top 10 mutated domains
laml.pfam = pfamDomains(maf = laml_tum, top = 10)

# Amino acid positions 
proteinhead <- head(laml.pfam$proteinSummary)
gt(proteinhead)
HGNC AAPos Variant_Classification N total fraction DomainLabel pfam Description
LILRA6 88 Missense_Mutation 4 6 0.66666667 Ig1_LILRB1_like cd05751 First immunoglobulin (Ig)-like domain found in Leukocyte Ig-like receptors (LILR)B1 (also known as LIR-1) and similar proteins
PPIAL4D 108 Missense_Mutation 4 4 1.00000000 cyclophilin_ABH_like cd01926 cyclophilin_ABH_like: Cyclophilin A, B and H-like cyclophilin-type peptidylprolyl cis- trans isomerase (PPIase) domain. This family represents the archetypal cystolic cyclophilin similar to human cyclophilins A, B and H. PPIase is an enzyme which...
CYP2D6 486 Missense_Mutation 3 6 0.50000000 p450 pfam00067 Cytochrome P450
DND1 68 Missense_Mutation 3 3 1.00000000 RRM cd00590 RRM (RNA recognition motif), also known as RBD (RNA binding domain) or RNP (ribonucleoprotein domain), is a highly abundant domain in eukaryotes found in proteins involved in post-transcriptional gene expression processes including mRNA and rRNA...
IGF2BP3 503 Missense_Mutation 3 4 0.75000000 KH-I cd00105 K homology RNA-binding domain, type I. KH binds single-stranded RNA or DNA. It is found in a wide variety of proteins including ribosomal proteins, transcription factors and post-transcriptional modifiers of mRNA. There are two different KH domains that...
TP53 278 Missense_Mutation 3 59 0.05084746 P53 cd08367 P53 DNA-binding domain
# Domains
domainhead <- head(laml.pfam$domainSummary)
gt(domainhead)
DomainLabel nMuts nGenes pfam Description
7tm_1 142 124 pfam00001 7 transmembrane receptor (rhodopsin family)
COG5048 108 91 NA FOG: Zn-finger [General function prediction only]
Cadherin_repeat 89 50 cd11304 Cadherin tandem repeat domain
FN3 56 46 One of three types of internal repeats found in the plasma protein fibronectin. Its tenth fibronectin type III repeat contains an RGD cell recognition sequence in a flexible loop between 2 strands. Approximately 2% of all... Fibronectin type 3 domain
P53 53 1 cd08367 P53 DNA-binding domain
COG2319 52 43 NA FOG: WD40 repeat [General function prediction only]

Here are the results stratified by histology

# Serous
laml.pfam.ser = pfamDomains(maf = laml_ser, top = 10)

# Non-Serous
laml.pfam.other = pfamDomains(maf = laml_other, top = 10)

Here they are stratified by response

# CB
laml.pfam.CB = pfamDomains(maf = laml_CB, top = 10)

# CB Domains
domainhead_CB <- head(laml.pfam.CB$domainSummary, n=10)
gt(domainhead_CB)
DomainLabel nMuts nGenes pfam Description
7tm_1 46 45 pfam00001 7 transmembrane receptor (rhodopsin family)
Cadherin_repeat 34 29 cd11304 Cadherin tandem repeat domain
COG5048 32 29 NA FOG: Zn-finger [General function prediction only]
WD40 20 19 typically contains a GH dipeptide 11-24 residues from... WD40 domain, found in a number of eukaryotic proteins that cover a wide variety of functions including adaptor/regulatory modules in signal transduction, pre-mRNA processing and cytoskeleton assembly
COG2319 19 18 NA FOG: WD40 repeat [General function prediction only]
FN3 19 18 One of three types of internal repeats found in the plasma protein fibronectin. Its tenth fibronectin type III repeat contains an RGD cell recognition sequence in a flexible loop between 2 strands. Approximately 2% of all... Fibronectin type 3 domain
P53 18 1 cd08367 P53 DNA-binding domain
LamG 15 14 Laminin G-like domains are usually Ca++ mediated receptors that can have binding sites for steroids, beta1 integrins, heparin, sulfatides, fibulin-1, and alpha-dystroglycans. Proteins that contain LamG domains serve a variety of... Laminin G domain
ANK 14 13 ankyrin repeats mediate protein-protein interactions in very diverse families of proteins. The number of ANK repeats in a protein can range from 2 to over 20 (ankyrins, for example). ANK repeats may occur in combinations with other... ankyrin repeats
I-set 14 13 pfam07679 Immunoglobulin I-set domain
# NCB
laml.pfam.NCB = pfamDomains(maf = laml_NCB, top = 10)

# NCB Domains
domainhead_NCB <- head(laml.pfam.NCB$domainSummary, n=10)
gt(domainhead_NCB)
DomainLabel nMuts nGenes pfam Description
7tm_1 96 85 pfam00001 7 transmembrane receptor (rhodopsin family)
COG5048 76 63 NA FOG: Zn-finger [General function prediction only]
Cadherin_repeat 55 35 cd11304 Cadherin tandem repeat domain
FN3 37 31 One of three types of internal repeats found in the plasma protein fibronectin. Its tenth fibronectin type III repeat contains an RGD cell recognition sequence in a flexible loop between 2 strands. Approximately 2% of all... Fibronectin type 3 domain
P53 35 1 cd08367 P53 DNA-binding domain
COG2319 33 27 NA FOG: WD40 repeat [General function prediction only]
I-set 26 22 pfam07679 Immunoglobulin I-set domain
S_TKc 24 21 smart00220 Serine/Threonine protein kinases, catalytic domain
Ig 23 18 cl11960 Immunoglobulin domain
ANK 21 18 ankyrin repeats mediate protein-protein interactions in very diverse families of proteins. The number of ANK repeats in a protein can range from 2 to over 20 (ankyrins, for example). ANK repeats may occur in combinations with other... ankyrin repeats
#KelseyNote

The domains are largely similar between CB and NCB, but CB has WD40 in the top 4, but it’s #20 for NCB.

LamG is also only in CB top 10 (#13 for NCB), while S_TKc and Ig are in NCB top 10 (Ig is #19 for CB; S_TKc not in CB top 20).

Again, not sure how meaningful this kind of analysis is.

Survival Analysis

#Survival analysis based on grouping of TP53 mutation status
mafSurvival(maf = laml_tum, genes = 'TP53', time = 'Survtime', Status = 'Survstat')
TP53 
  59 
    Group medianTime     N
   <char>      <num> <int>
1: Mutant      26.84    59
2:     WT      14.19    27

# This...looks wrong. I don't think the events are being recognized correctly.